Netflix Data Analysis

Data Preprocessing

(Cleaning data)

1. Data Types

Type

Year Released/Added

Rating

Duration

2. Missing data

Notice:

* Large amount of director and cast are missing
* Country has some data missing
* Small portion of year-added is missing

Questions: How many percentage of NaN input in total inputs of each columns?

Action(s):

1. Drop director, cast, because of large amount missing input that is not needed for data analysis (Might need for recommender system)
2. Drop description because of the complex content to analyze
3. Fill missing country with US
4. Drop all other missing data because of small impacts

Exploratory Data Analysis

(Using processed data to better understand Netflix's content data as well as Movies and TV Shows trend)

1. Understanding of the popularity of movies and TV shows on Netflix in different countries.

Notice:

* About 1/3 of Netflix's content is movies and 2/3 of it is TV Shows

Question: How are the contents distributed in different countries?

Notice:

* Some are combination of many countries

Notice:

* US accounts for more than 50% in top 10
* US, India, and UK contribute about 75% in top 10

Question: How about each type of contents in top 10 countries?

(ignore the combination of many countries)

Notice:

* Number of movies is 2 times number of TV shows in US, Canada, Spain, Germany, and Mexico
* About 8 times number of TV shows in India
* Approximately the same in UK and France
* It reverses in Japan and South Korea: Number of movies is about 1/3 number of TV shows
* Need to consider the lists of many countries

2. Exploring Netflix's vision on focusing on which contents in recent years

Action(s):

1. Do the same for nf_movie --> call df as movie
2. Do the same for nf_tv --> call df as tv

Notice:

* The growth in content started from 2013
* The growth in number of movies is much higher than that of TV shows on Netflix --> Netflix is focusing on Movies
* More than 1200 new movies were added in both 2018 and 2019
* The data is collected as of 2019, so 2020's data is misled in this case.

Question: Does Netflix add the contents right away after the contents was released, only take data after 2000?

(Check if there is relationship between year_added and year_released)

Notice:

* The highest increase in producing movies is in 2017, and in 2019 for TV shows
* The data is collected as of 2019, so 2020's data is misled in this case.

3. Understanding what content is available for different target audience (kids, teenagers, adults)

Notice:

* Most of shows are for teens and adults, small portions are for kids.

Question: How about each type of contents?

Notice:

* Both pies show largest amount of content for Adults
* There are more movies than TV shows added for adults and teens.
* More of TV shows are added for kids than movies.

4. Find correlation between target audience and duration of each kinds of contents

Notice:

Notice:

* Most of movies are in range from 80 min to 2 hours. These movies are mostly for adults.
* Longer movies are made for teenagers and kids.

Action:

1. Do the same to see the pattern in TV Shows duration

Notice:

* Most of TV shows have 1 season. The amount decreases in longer seasons (from 8 to 15).
* The distribution of length of TV shows is approximatly equal among 3 audience groups.

5. Exploring genre of different types of movies and TV shows

Notice:

* There is negative relationship between drama and documentary.
* There are many dramas for independent and international films, and also many Sci-Fi & Fantasy movies for Action & Adventure.

Notice:

* Negative relationship between Kids' TV and International TV shows, but good amount of International TV shows are Romatic TV shows
* There are many Documentaries for Science and Nature TV.
* There's positive correlation between TV Horror and TV Mysteries

Notice:

* Most of movies are Documentaries, Stand-Up Comedy, Dramas and International Movies
* Most of TV shows are Kids' TV, International TV Shows, and TV Dramas